from IPython import display
display.Image("/content/Submission form.png")
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
This analysis embarks on a comprehensive examination of the impact of the COVID-19 pandemic on digital learning in the United States. As an experienced data scientist, my goal is to provide educators, policymakers, and academic researchers with valuable insights into how digital learning unfolded during this unprecedented time. Utilizing a rich dataset from LearnPlatform, this study focuses on the engagement patterns in digital learning tools and platforms, scrutinizing their relation to various demographic and geographic factors.
The COVID-19 pandemic has led to a seismic shift in educational practices, affecting over 56 million students in the U.S. With the closure of educational institutions in the Spring of 2020, there was a swift pivot to remote teaching through digital platforms. This study seeks to understand the nuances of this transition, exploring the digital divide and the potential long-term learning loss among vulnerable student populations.
The analysis will involve a thorough exploration of the dataset, ensuring data integrity and addressing any missing values or anomalies. Descriptive statistics will provide a foundational understanding of the engagement trends. Advanced visualizations will be employed to uncover patterns and correlations within the data.
1.Engagement Data: Collected through LearnPlatform’s Student Chrome Extension, capturing page load events across 10K+ educational technology products.
2.Products Info: Detailing the characteristics of the top 372 educational products used in 2020.
3.Districts Info: Providing information on the demographics and characteristics of various school districts, supplemented by NCES and FCC data.
1.State of Digital Learning in 2020: Graphical trends illustrating how digital engagement evolved during the year, especially during the COVID-19 peak from May to September 2020.
2.Comparative Analysis by Geography and Demographics: Charts comparing digital engagement across different geographies, racial/ethnic groups, and socioeconomic backgrounds.
3.Correlation Analysis: Heatmaps or scatter plots to explore correlations between digital engagement and state interventions or policies. Future Predictions and Trends:
Projection of how digital learning might evolve post-pandemic, using the observed data trends.
Educational Strategy: The insights will aid in shaping future digital learning strategies and policies, especially in times of crisis.
Resource Allocation: Helps in identifying areas needing more digital resources or support.
Data Reliance: There's a risk of over-reliance on quantitative data, potentially overlooking qualitative aspects of education.
Potential Biases: The dataset may have inherent biases, affecting the generalizability of the findings.
Rich Data Source: The dataset offers a broad spectrum of metrics and a comprehensive view of the digital learning landscape during the pandemic.
Diverse Perspectives: Covers a wide range of demographics and geographies, providing a holistic view of the situation.
Scope of Data:The dataset might not encapsulate all nuances of the pandemic’s impact on different student groups.
Interpretation Challenges:The subjective nature of some data points, like engagement levels, may pose challenges in interpretation.
import numpy as np
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import matplotlib
from datetime import datetime
Data Definition:
A) District information data
The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab.
# Loading dataset districts_info.csv
districts_dataframe = pd.read_csv('/content/districts_info.csv')
Q: What is the structure of districts info dataset?
| No. | Feature Name | Description of the feature |
|---|---|---|
| 01 | district_id | The unique identifier of the school district |
| 02 | state | The state where the district resides in |
| 03 | locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information. |
| 04 | pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| 05 | pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| 06 | county_connections_ratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information. |
| 07 | pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |
# get shape of dataframe
print('Shape of districts info dataset is:', districts_dataframe.shape)
# print summary of dataframe
districts_dataframe.info()
Shape of districts info dataset is: (233, 7) <class 'pandas.core.frame.DataFrame'> RangeIndex: 233 entries, 0 to 232 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 district_id 233 non-null int64 1 state 176 non-null object 2 locale 176 non-null object 3 pct_black/hispanic 176 non-null object 4 pct_free/reduced 148 non-null object 5 county_connections_ratio 162 non-null object 6 pp_total_raw 118 non-null object dtypes: int64(1), object(6) memory usage: 12.9+ KB
districts info dataset information:
Q: What does data looks like for districts info dataset?
# print first 10 rows of dataframe
districts_dataframe.head(10)
| district_id | state | locale | pct_black/hispanic | pct_free/reduced | county_connections_ratio | pp_total_raw | |
|---|---|---|---|---|---|---|---|
| 0 | 8815 | Illinois | Suburb | [0, 0.2[ | [0, 0.2[ | [0.18, 1[ | [14000, 16000[ |
| 1 | 2685 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 4921 | Utah | Suburb | [0, 0.2[ | [0.2, 0.4[ | [0.18, 1[ | [6000, 8000[ |
| 3 | 3188 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2238 | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 5987 | Wisconsin | Suburb | [0, 0.2[ | [0, 0.2[ | [0.18, 1[ | [10000, 12000[ |
| 6 | 3710 | Utah | Suburb | [0, 0.2[ | [0.4, 0.6[ | [0.18, 1[ | [6000, 8000[ |
| 7 | 7177 | North Carolina | Suburb | [0.2, 0.4[ | [0.2, 0.4[ | [0.18, 1[ | [8000, 10000[ |
| 8 | 9812 | Utah | Suburb | [0, 0.2[ | [0.2, 0.4[ | [0.18, 1[ | [6000, 8000[ |
| 9 | 6584 | North Carolina | Rural | [0.4, 0.6[ | [0.6, 0.8[ | [0.18, 1[ | [8000, 10000[ |
Q: What is the statistics description for districts info dataset?
# print descriptive statistics for both object and numeric type
districts_dataframe.describe(include='all')
| district_id | state | locale | pct_black/hispanic | pct_free/reduced | county_connections_ratio | pp_total_raw | |
|---|---|---|---|---|---|---|---|
| count | 233.000000 | 176 | 176 | 176 | 148 | 162 | 118 |
| unique | NaN | 23 | 4 | 5 | 5 | 2 | 11 |
| top | NaN | Connecticut | Suburb | [0, 0.2[ | [0.2, 0.4[ | [0.18, 1[ | [8000, 10000[ |
| freq | NaN | 30 | 104 | 116 | 48 | 161 | 30 |
| mean | 5219.776824 | NaN | NaN | NaN | NaN | NaN | NaN |
| std | 2595.751581 | NaN | NaN | NaN | NaN | NaN | NaN |
| min | 1000.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | 2991.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | 4937.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | 7660.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
| max | 9927.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
districts info dataset data description:
Handle missing value for districts_data:
#check for missing values
districts_dataframe.isna().sum()
district_id 0 state 57 locale 57 pct_black/hispanic 57 pct_free/reduced 85 county_connections_ratio 71 pp_total_raw 115 dtype: int64
Q: Are there missing values for all columns in a row?
# check if there are rows for which all columns have missing values
districts_dataframe[districts_dataframe['pp_total_raw'].isna()]
| district_id | state | locale | pct_black/hispanic | pct_free/reduced | county_connections_ratio | pp_total_raw | |
|---|---|---|---|---|---|---|---|
| 1 | 2685 | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 3188 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2238 | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | 1904 | Connecticut | Rural | [0, 0.2[ | [0, 0.2[ | [0.18, 1[ | NaN |
| 17 | 8515 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 218 | 9536 | New York | City | [0.8, 1[ | [0.8, 1[ | [0.18, 1[ | NaN |
| 221 | 2201 | Connecticut | City | [0.4, 0.6[ | [0.4, 0.6[ | [0.18, 1[ | NaN |
| 222 | 9463 | NaN | NaN | NaN | NaN | NaN | NaN |
| 231 | 7975 | California | City | [0.6, 0.8[ | [0.6, 0.8[ | [0.18, 1[ | NaN |
| 232 | 7164 | California | City | [0.6, 0.8[ | [0.6, 0.8[ | [0.18, 1[ | NaN |
115 rows × 7 columns
#drop rows which has missing values for all the six columns
districts_dataframe.dropna(thresh=6, inplace=True)
#check for missing values
districts_dataframe.isna().sum()
district_id 0 state 0 locale 0 pct_black/hispanic 0 pct_free/reduced 22 county_connections_ratio 4 pp_total_raw 50 dtype: int64
Note: Need to impute (populate) with relevant values for Missing values in pct_free/reduced, county_connections_ratio and pp_total_raw columns further during analyis as it requires special handling.
districts info dataset data distribution:
Q: What is the percentage distribution count for locale?
districts_dataframe['locale'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1], fontsize=14, autopct='%3.1f%%',
figsize=(12, 6), shadow=True, startangle=135, legend=False, cmap='viridis')
plt.suptitle(t = 'PROPORTIONAL DISTRIBUTION OF COUNTS BY LOCALE', y = 1.05, size = 19)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = districts_dataframe['locale'].value_counts().index, loc ='lower right', frameon = True)
# Show plot
plt.show()
Q: What is the percentage distribution count for state?
# plot percentage distribution count of state
districts_dataframe['state'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0], fontsize=14, autopct='%3.1f%%',
figsize=(30,25), shadow=True, startangle=135, legend=False, cmap='tab20')
plt.suptitle(t = 'PERCENTAGE DISTRIBUTION COUNT FOR STATE', y = 1.05, size = 40)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = districts_dataframe['state'].value_counts().index, loc ='lower right', frameon = True)
plt.show()
#distribution of count for state
sns.catplot(data=districts_dataframe,y='state', kind='count', height=8, aspect=1)
plt.show()
Q: What is the distribution between state and locale?
#distribution of state and locale
sns.displot(data=districts_dataframe, y='state', hue='locale', height=7, aspect=0.8)
plt.show()
#distribution of state and locale
sns.displot(data=districts_dataframe, y='state', hue='locale', col='locale', height=15, aspect=0.6)
plt.show()
Q: What is the distribution for pct_black/hispanic?
#distribution of count for pct_black/hispanic
sns.countplot(data=districts_dataframe, x='pct_black/hispanic', facecolor=(0, 0, 0, 0),linewidth=5,edgecolor=sns.color_palette("deep", 3))
plt.show()
# process and separate pct_black/hispanic
pct_black_hispanic = districts_dataframe['pct_black/hispanic'].str.split(",",n=1,expand=True)
# extract pct_black and pct_hispanic
districts_dataframe['pct_black']=pct_black_hispanic[0].str.replace('[','',regex=True)
districts_dataframe['pct_hispanic']= pct_black_hispanic[1].str.replace('[','',regex=True)
# convert pct_black and pct_hispanic to numeric
districts_dataframe['pct_black']=pd.to_numeric(districts_dataframe['pct_black'])
districts_dataframe['pct_hispanic']=pd.to_numeric(districts_dataframe['pct_hispanic'])
# combine with mean value
districts_dataframe['pct_black/hispanic']=(districts_dataframe['pct_black'] + districts_dataframe['pct_hispanic'])/2
#distribution of pct_black/hispanic with locale
sns.displot(data=districts_dataframe, x='pct_black/hispanic', hue='locale',kind='kde')
plt.show()
#distribution of pct_black/hispanic by locale
sns.boxplot(x='locale',y='pct_black/hispanic',data=districts_dataframe, palette='coolwarm')
plt.show()
# distribution of pct_black/hispanic with state
sns.displot(data=districts_dataframe, x="pct_black/hispanic", hue='state', height=9, aspect=.8)
plt.show()
Q: What is the distribution for pct_free/reduced?
#distribution of count for pct_free/reduced
sns.countplot(data=districts_dataframe, x='pct_free/reduced', facecolor=(0, 0, 0, 0),linewidth=5,edgecolor=sns.color_palette("deep", 3))
plt.show()
pct_free_reduced = districts_dataframe['pct_free/reduced'].str.split(",",n=1,expand=True)
districts_dataframe['pct_free']=pct_free_reduced[0].str.replace('[','',regex=True)
districts_dataframe['pct_reduced']= pct_free_reduced[1].str.replace('[','',regex=True)
districts_dataframe['pct_free']=pd.to_numeric(districts_dataframe['pct_free'])
districts_dataframe['pct_reduced']=pd.to_numeric(districts_dataframe['pct_reduced'])
# populate missing value for pct_free and pct_reduced with median
districts_dataframe['pct_free'].fillna(districts_dataframe['pct_free'].median(), inplace=True)
districts_dataframe['pct_reduced'].fillna(districts_dataframe['pct_reduced'].median(), inplace=True)
#combine pct_free and pct_reduced with mean value
districts_dataframe['pct_free/reduced']=(districts_dataframe['pct_free'] + districts_dataframe['pct_reduced'])/2
#distribution of pct_free/reduced with locale
sns.displot(data=districts_dataframe, x='pct_free/reduced', hue='locale',kind='kde')
plt.show()
#distribution of pct_free/reduced by locale
sns.boxplot(x='locale',y='pct_free/reduced',data=districts_dataframe, palette='coolwarm')
plt.show()
# distribution of pct_free/reduced with state
sns.displot(data=districts_dataframe, x="pct_free/reduced", hue='state', height=9, aspect=.9)
plt.show()
Q: What is the distribution for county_connections_ratio?
#distribution of count for county_connections_ratio
sns.countplot(data=districts_dataframe, x='county_connections_ratio', facecolor=(0, 0, 0, 0),linewidth=5,edgecolor=sns.color_palette("deep", 3))
plt.show()
# process and separate county_connections_ratio
county_connections_ratio = districts_dataframe['county_connections_ratio'].str.split(",",n=1,expand=True)
#extract county and connections value
districts_dataframe['county']=county_connections_ratio[0].str.replace('[','',regex=True)
districts_dataframe['connections']= county_connections_ratio[1].str.replace('[','',regex=True)
# convert county and connections to numeric
districts_dataframe['county']=pd.to_numeric(districts_dataframe['county'])
districts_dataframe['connections']=pd.to_numeric(districts_dataframe['connections'])
# populate missing value for county and connections with median
districts_dataframe['county'].fillna(districts_dataframe['county'].median(), inplace=True)
districts_dataframe['connections'].fillna(districts_dataframe['connections'].median(), inplace=True)
#combine county and connections with mean value
districts_dataframe['county_connections_ratio'] = (districts_dataframe['county'] + districts_dataframe['connections'])/2
# distribution of county connections ratio with state
sns.displot(data=districts_dataframe, x="county_connections_ratio", hue='state', height=6, aspect=.9)
plt.show()
# distribution of county connection ratio with locale
sns.displot(data=districts_dataframe, x="county_connections_ratio", hue='locale', height=6, aspect=.9)
plt.show()
#distribution of county_connections_ratio by locale
sns.boxplot(x='locale',y='county_connections_ratio',data=districts_dataframe, palette='tab20')
plt.show()
Q: What is the distribution for pp_total_raw?
#distribution of count for pp_total_raw
sns.countplot(data=districts_dataframe, y='pp_total_raw', facecolor=(0, 0, 0, 0),linewidth=4,edgecolor=sns.color_palette("deep", 3))
plt.show()
# process and separate pp_total_raw
pp_total_raw = districts_dataframe['pp_total_raw'].str.split(",",n=1,expand=True)
# extract local and federal expenditure value
districts_dataframe['pp_local']=pp_total_raw[0].str.replace('[','',regex=True)
districts_dataframe['pp_federal']= pp_total_raw[1].str.replace('[','',regex=True)
# convert pp_local and pp_federal to numeric
districts_dataframe['pp_local']=pd.to_numeric(districts_dataframe['pp_local'])
districts_dataframe['pp_federal']=pd.to_numeric(districts_dataframe['pp_federal'])
# populate missing value for pp_local and pp_federal with median
districts_dataframe['pp_local'].fillna(districts_dataframe['pp_local'].median(), inplace=True)
districts_dataframe['pp_federal'].fillna(districts_dataframe['pp_federal'].median(), inplace=True)
# combine with mean value
districts_dataframe['pp_total_raw']=(districts_dataframe['pp_local'] + districts_dataframe['pp_federal'])/2
# distribution of pp_total_raw expenditure with locale
sns.displot(data=districts_dataframe, x='pp_total_raw', hue='locale',kind='kde')
plt.show()
#distribution of pp_total_raw by locale
sns.boxplot(x='locale',y='pp_total_raw',data=districts_dataframe, palette='coolwarm')
plt.show()
# distribution of pp_total_raw expenditure with state
sns.displot(data=districts_dataframe, x="pp_total_raw", hue='state', height=9, aspect=.9)
plt.show()
Note: Need to remove temporary columns created for special handling.
# print summary of updated dataframe
districts_dataframe.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 164 entries, 0 to 232 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 district_id 164 non-null int64 1 state 164 non-null object 2 locale 164 non-null object 3 pct_black/hispanic 164 non-null float64 4 pct_free/reduced 164 non-null float64 5 county_connections_ratio 164 non-null float64 6 pp_total_raw 164 non-null float64 7 pct_black 164 non-null float64 8 pct_hispanic 164 non-null float64 9 pct_free 164 non-null float64 10 pct_reduced 164 non-null float64 11 county 164 non-null float64 12 connections 164 non-null float64 13 pp_local 164 non-null float64 14 pp_federal 164 non-null float64 dtypes: float64(12), int64(1), object(2) memory usage: 24.6+ KB
#drop temporary columns which is not required for analysis
districts_dataframe = districts_dataframe.drop(columns=['pct_black','pct_hispanic','pct_free','pct_reduced','county','connections','pp_local','pp_federal'])
print("Shape of districts info in dataset is", districts_dataframe.shape)
# print summary of updated dataframe
districts_dataframe.info()
Shape of districts info in dataset is (164, 7) <class 'pandas.core.frame.DataFrame'> Int64Index: 164 entries, 0 to 232 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 district_id 164 non-null int64 1 state 164 non-null object 2 locale 164 non-null object 3 pct_black/hispanic 164 non-null float64 4 pct_free/reduced 164 non-null float64 5 county_connections_ratio 164 non-null float64 6 pp_total_raw 164 non-null float64 dtypes: float64(4), int64(1), object(2) memory usage: 14.3+ KB
Q: What is the distribution and relationship for numerical variables?
#plot pairwise relationship for numeric type
sns.pairplot(data=districts_dataframe,kind='kde')
plt.show()
Q: What is the correlation for numerical variables?
#Check corelation primarily between Numerical features
corr = districts_dataframe.corr(method='pearson')
plt.figure(figsize=(11,11))
sns.heatmap(corr,vmax=.9,linewidth=.02, square = True, annot = True,cmap='coolwarm',linecolor ='Black')
plt.show()
<ipython-input-45-8e86eb729802>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = districts_dataframe.corr(method='pearson')
Data Definition:
B) Product information data
The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020.
# Loading dataset products_info.csv
products_dataframe = pd.read_csv('/content/products_info.csv')
Q: What is the structure of products info dataset?
| No. | Feature Name | Description of the feature |
|---|---|---|
| 01 | LP ID | The unique identifier of the product |
| 02 | URL | Web Link to the specific product |
| 03 | Product Name | Name of the specific product |
| 04 | Provider/Company Name | Name of the product provider |
| 05 | Sector(s) | Sector of education where the product is used |
| 06 | Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |
# get shape of dataframe
print('The Shape of products info in the dataset:', products_dataframe.shape)
# print summary of dataframe
products_dataframe.info()
The Shape of products info in the dataset: (372, 6) <class 'pandas.core.frame.DataFrame'> RangeIndex: 372 entries, 0 to 371 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LP ID 372 non-null int64 1 URL 372 non-null object 2 Product Name 372 non-null object 3 Provider/Company Name 371 non-null object 4 Sector(s) 352 non-null object 5 Primary Essential Function 352 non-null object dtypes: int64(1), object(5) memory usage: 17.6+ KB
products info dataset information:
Q: What does data looks like for products info dataset?
# print first 10 rows of dataframe
products_dataframe.head(10)
| LP ID | URL | Product Name | Provider/Company Name | Sector(s) | Primary Essential Function | |
|---|---|---|---|---|---|---|
| 0 | 13117 | https://www.splashmath.com | SplashLearn | StudyPad Inc. | PreK-12 | LC - Digital Learning Platforms |
| 1 | 66933 | https://abcmouse.com | ABCmouse.com | Age of Learning, Inc | PreK-12 | LC - Digital Learning Platforms |
| 2 | 50479 | https://www.abcya.com | ABCya! | ABCya.com, LLC | PreK-12 | LC - Sites, Resources & Reference - Games & Si... |
| 3 | 92993 | http://www.aleks.com/ | ALEKS | McGraw-Hill PreK-12 | PreK-12; Higher Ed | LC - Digital Learning Platforms |
| 4 | 73104 | https://www.achieve3000.com/ | Achieve3000 | Achieve3000 | PreK-12 | LC - Digital Learning Platforms |
| 5 | 37600 | http://www.activelylearn.com/ | Actively Learn | Actively Learn | PreK-12 | LC - Digital Learning Platforms |
| 6 | 18663 | http://www.adaptedmind.com | AdaptedMind | GloWorld | PreK-12 | LC - Digital Learning Platforms |
| 7 | 65131 | http://www.amplify.com/ | Amplify | Amplify Education, Inc. | PreK-12 | LC - Courseware & Textbooks |
| 8 | 26491 | http://www.answers.com/ | Answers | Answers | PreK-12; Higher Ed | LC - Study Tools - Q&A |
| 9 | 56441 | http://www.audible.com | Audible | Amazon.com, Inc. | PreK-12; Higher Ed; Corporate | LC - Sites, Resources & Reference - Streaming ... |
Q: What is the statistics description for products info dataset?
# print descriptive statistics for both object and numeric type
products_dataframe.describe(include='all')
| LP ID | URL | Product Name | Provider/Company Name | Sector(s) | Primary Essential Function | |
|---|---|---|---|---|---|---|
| count | 372.000000 | 372 | 372 | 371 | 352 | 352 |
| unique | NaN | 372 | 372 | 290 | 5 | 35 |
| top | NaN | https://www.splashmath.com | SplashLearn | Google LLC | PreK-12 | LC - Digital Learning Platforms |
| freq | NaN | 1 | 1 | 30 | 170 | 74 |
| mean | 54565.795699 | NaN | NaN | NaN | NaN | NaN |
| std | 26247.551437 | NaN | NaN | NaN | NaN | NaN |
| min | 10533.000000 | NaN | NaN | NaN | NaN | NaN |
| 25% | 30451.000000 | NaN | NaN | NaN | NaN | NaN |
| 50% | 53942.500000 | NaN | NaN | NaN | NaN | NaN |
| 75% | 77497.000000 | NaN | NaN | NaN | NaN | NaN |
| max | 99916.000000 | NaN | NaN | NaN | NaN | NaN |
products info dataset data description:
Handle missing value for products_data:
#check for missing values
products_dataframe.isna().sum()
LP ID 0 URL 0 Product Name 0 Provider/Company Name 1 Sector(s) 20 Primary Essential Function 20 dtype: int64
# Marking missing value as "Missing" as it is one off value
products_dataframe['Provider/Company Name'].fillna("Missing", inplace=True)
# populate missing value with mode value'Prek-12'
sector_mode= products_dataframe['Sector(s)'].mode()
products_dataframe['Sector(s)'].fillna(value=sector_mode[0], inplace=True)
# populate missing value with mode
essential_mode= products_dataframe['Primary Essential Function'].mode()
products_dataframe['Primary Essential Function'].fillna(value=essential_mode[0], inplace=True)
# rename LP ID column to lp_id for it to match as in engagement_data
products_dataframe.rename(columns={'LP ID':'lp_id'}, inplace=True)
products info dataset data distribution:
Q: What is the distribution count for Sector(s) feature?
# plot percentage distribution of Sector(s)
products_dataframe['Sector(s)'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0], fontsize=15, autopct='%3.1f%%',
figsize=(21,11), shadow=True, startangle=135, legend=False, cmap='Accent')
plt.title('PERCENTAGE DISTRIBUTION OF SECTOR TYPE', y = 1.02, size = 32)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = products_dataframe['Sector(s)'].value_counts().index, loc ='lower right', frameon = True)
plt.show()
# distribution of Sectors
sns.displot(data=products_dataframe, x='Sector(s)', height=6, aspect=3)
plt.show()
Q: What is the distribution count for Primary Essential Function feature?
#distribution of count for Primary Essential Function
sns.catplot(data=products_dataframe,y='Primary Essential Function', kind='count', height=8, aspect=2)
plt.show()
#distribution of Primary Essential Function and Sector(s)
sns.displot(data=products_dataframe, y='Primary Essential Function', hue='Sector(s)', height=8, aspect=2)
plt.show()
#check distribution of Primary Essential Function - LC Level
products_LC=products_dataframe[products_dataframe['Primary Essential Function'].str.contains("LC - ")]
products_LC['Primary Essential Function'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0], fontsize=14, autopct='%3.1f%%',
figsize=(39,34), shadow=True, startangle=136, legend=False, cmap='coolwarm')
plt.suptitle(t = 'PRIMARY ESSENTIAL FUNCTION - LC Level', y = 1.05, size = 45)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = products_LC['Primary Essential Function'].value_counts().index, loc ='lower right', frameon = True)
plt.show()
#check distribution of Primary Essential Function - CM Level
products_CM=products_dataframe[products_dataframe['Primary Essential Function'].str.contains("CM - ")]
products_CM['Primary Essential Function'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0], fontsize=15, autopct='%3.1f%%',
figsize=(22,11), shadow=True, startangle=136, legend=False, cmap='rainbow')
plt.suptitle(t = 'PRIMARY ESSENTIAL FUNCTION - CM Level', y = 1.04, size = 32)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = products_CM['Primary Essential Function'].value_counts().index, loc ='lower right', frameon = True)
plt.show()
#check distribution of Primary Essential Function - SDO Level
products_SDO=products_dataframe[products_dataframe['Primary Essential Function'].str.contains("SDO - ")]
products_SDO['Primary Essential Function'].value_counts().plot(kind='pie', explode=[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0], fontsize=14, autopct='%3.1f%%',
figsize=(32,11), shadow=True, startangle=134, legend=False, cmap='prism')
plt.suptitle(t = 'Primary Essential Function - SDO Level', y = 1.05, size = 38)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = products_SDO['Primary Essential Function'].value_counts().index, loc ='lower right', frameon = True)
plt.show()
Q: What is the text distribution view for Product Name feature?
# get unique Product Name
products_name= products_dataframe['Product Name'].unique()
# convert numpy array to string
products_name_str = ",".join(products_name)
# create WordCloud with converted string
wordcloud = WordCloud(width = 1050, height = 550, random_state=1, background_color='skyblue', collocations=False).generate(products_name_str)
plt.figure(figsize=(16, 16))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Q: What is the text distribution view for Provider/Company Name feature?
# get unique Provider/Company Name
products_provider= products_dataframe['Provider/Company Name'].unique()
# convert numpy array to string
products_provider_str = ",".join(products_provider)
# create WordCloud with converted string
wordcloud = WordCloud(width = 1050, height = 550, random_state=1, background_color='skyblue', collocations=True).generate(products_provider_str)
plt.figure(figsize=(16, 16))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Data Definition:
C) Engagement data
The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district.
#Loading CSV files for engagement_data
path = '/content/drive/MyDrive/engagement_data'
print(path)
csv_files = glob.glob(os.path.join(path, "*.csv"))
data = []
# loop over the list of csv files
for f in csv_files:
# read the csv file
frame = pd.read_csv(f)
# split filename and extension
filename = os.path.splitext(f)
# extract district ID from CSV filename tuple
frame['district_id'] = os.path.basename(filename[0])
# add new column to new list
data.append(frame)
# create dataframe for engagement data
engagement_dataframe = pd.concat(data)
/content/drive/MyDrive/engagement_data
Q: What is the structure of engagement dataset?
Note: 4-digit filename value was added as new column "district_id" in engagement_data
| No. | Feature Name | Description of the feature |
|---|---|---|
| 01 | time | date in "YYYY-MM-DD" |
| 02 | lp_id | The unique identifier of the product |
| 03 | pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| 04 | engagement_index | Total page-load events per one thousand students of a given product and on a given day |
| 05 | district_id | Extracted from filename |
# get shape of dataframe
print('The Shape of engagement in dataset:', engagement_dataframe.shape)
# print summary of dataframe
engagement_dataframe.info(show_counts=True)
The Shape of engagement in dataset: (22324190, 5) <class 'pandas.core.frame.DataFrame'> Int64Index: 22324190 entries, 0 to 116881 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 22324190 non-null object 1 lp_id 22323649 non-null float64 2 pct_access 22310743 non-null float64 3 engagement_index 16945781 non-null float64 4 district_id 22324190 non-null object dtypes: float64(3), object(2) memory usage: 1021.9+ MB
engagement dataset information:
Q: What does data looks like for engagement dataset?
# print first 10 rows of dataframe
engagement_dataframe.head(5)
| time | lp_id | pct_access | engagement_index | district_id | |
|---|---|---|---|---|---|
| 0 | 2020-01-01 | 57513.0 | 0.49 | 141.28 | 1052 |
| 1 | 2020-01-01 | 42091.0 | 0.04 | 0.41 | 1052 |
| 2 | 2020-01-01 | 53399.0 | 0.04 | 8.19 | 1052 |
| 3 | 2020-01-01 | 32213.0 | 0.90 | 44.23 | 1052 |
| 4 | 2020-01-01 | 85836.0 | 0.04 | 0.41 | 1052 |
Q: What is the statistics description for engagement dataset?
# print descriptive statistics for both object and numeric type
engagement_dataframe.describe(include='all')
| time | lp_id | pct_access | engagement_index | district_id | |
|---|---|---|---|---|---|
| count | 22324190 | 2.232365e+07 | 2.231074e+07 | 1.694578e+07 | 22324190 |
| unique | 366 | NaN | NaN | NaN | 233 |
| top | 2020-12-10 | NaN | NaN | NaN | 2956 |
| freq | 102833 | NaN | NaN | NaN | 290683 |
| mean | NaN | 5.470879e+04 | 5.042399e-01 | 1.676063e+02 | NaN |
| std | NaN | 2.647069e+04 | 3.180568e+00 | 1.682223e+03 | NaN |
| min | NaN | 1.000300e+04 | 0.000000e+00 | 1.000000e-02 | NaN |
| 25% | NaN | 3.085100e+04 | 0.000000e+00 | 3.700000e-01 | NaN |
| 50% | NaN | 5.500700e+04 | 2.000000e-02 | 1.920000e+00 | NaN |
| 75% | NaN | 7.766000e+04 | 9.000000e-02 | 1.365000e+01 | NaN |
| max | NaN | 9.999100e+04 | 1.000000e+02 | 2.130455e+05 | NaN |
engagement dataset data description:
Handle missing value for engagement_data:
#check for missing values
engagement_dataframe.isna().sum()
time 0 lp_id 541 pct_access 13447 engagement_index 5378409 district_id 0 dtype: int64
# drop missing values as they are not required for analysis
engagement_dataframe.dropna(inplace=True)
#check for missing values
engagement_dataframe.isna().sum()
time 0 lp_id 0 pct_access 0 engagement_index 0 district_id 0 dtype: int64
#convert lp_id from float to int numeric type
engagement_dataframe['lp_id']= engagement_dataframe['lp_id'].astype(int)
# convert district_id from object type to numeric type
engagement_dataframe['district_id'] = pd.to_numeric(engagement_dataframe['district_id'])
# convert time from object type to datetime type
engagement_dataframe['time']= pd.to_datetime(engagement_dataframe['time'])
# print summary of dataframe
engagement_dataframe.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 16945301 entries, 0 to 116881 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 16945301 non-null datetime64[ns] 1 lp_id 16945301 non-null int64 2 pct_access 16945301 non-null float64 3 engagement_index 16945301 non-null float64 4 district_id 16945301 non-null int64 dtypes: datetime64[ns](1), float64(2), int64(2) memory usage: 775.7 MB
Q: What does data looks like for engagement dataset?
# print first 10 rows of dataframe
engagement_dataframe.head(5)
| time | lp_id | pct_access | engagement_index | district_id | |
|---|---|---|---|---|---|
| 0 | 2020-01-01 | 57513 | 0.49 | 141.28 | 1052 |
| 1 | 2020-01-01 | 42091 | 0.04 | 0.41 | 1052 |
| 2 | 2020-01-01 | 53399 | 0.04 | 8.19 | 1052 |
| 3 | 2020-01-01 | 32213 | 0.90 | 44.23 | 1052 |
| 4 | 2020-01-01 | 85836 | 0.04 | 0.41 | 1052 |
engagement data dataset data distribution:
Q: What is the trend for engagement_index in 2020?
# group by time without index
engagement_data_by_date11 = engagement_dataframe.groupby(by = 'time', as_index = False).agg('mean')
fig, ax = plt.subplots(figsize=[16, 8])
sns.lineplot(x='time', y='engagement_index', data=engagement_data_by_date11, color='#1f77b4', ax=ax)
ax.set_xlabel('DATE FOR PAGE LOAD EVENTS', size=15)
ax.set_ylabel('ENGAGEMENT INDEX', size=14)
ax.legend(labels=['ENGAGEMENT INDEX'], loc='upper left', frameon=False)
ax.set_title('ENGAGEMENT INDEX TREND', size=15)
ax.grid(True, axis='y')
plt.show()
Q: What is the trend for pct_access in 2020?
fig, ax = plt.subplots(figsize=[15, 7])
sns.lineplot(x='time', y='pct_access', data=engagement_data_by_date11, color='#1f77b4', ax=ax)
ax.set_xlabel('DATE FOR PAGE LOAD EVENTS', size=15)
ax.set_ylabel('PERCENT ACCESS', size=15)
ax.legend(labels=['PERCENT ACCESS'], loc='upper left', frameon=False)
ax.set_title('PPERCENT ACCESS TREND', size=15)
ax.grid(True, axis='y')
plt.show()
Q: What is the trend for engagement_index between May 2020 and Sep 2020?
# group by time with time as index
engagement_data_by_date22 = engagement_dataframe.groupby(by = 'time', as_index = True).agg('mean')
# trend for engagement_index between May 2020 and Sept 2020
matplotlib.rcParams['figure.figsize']= [16, 8]
start_date = datetime(2020,5,1)
end_date = datetime(2020,9,30)
engagement_data_by_date22[(start_date<=engagement_data_by_date22.index) & (engagement_data_by_date22.index<=end_date)].plot(y='engagement_index',grid=True)
plt.show()
Q: What is the trend for pct_access between May 2020 and Sep 2020?
# trend for pct_access between May 2020 and Sept 2020
matplotlib.rcParams['figure.figsize']= [16, 8]
engagement_data_by_date22[(start_date<=engagement_data_by_date22.index) & (engagement_data_by_date22.index<=end_date)].plot(y='pct_access',grid=True)
plt.show()
Q: What is the distribution for shape of data for engagement_index with respect to time?
#Checking with lag_plot for engagement_index
from pandas.plotting import lag_plot
plt.figure(figsize=(16,8))
lag_plot(engagement_data_by_date22.engagement_index)
plt.show()
Q: What is the distribution for shape of data for pct_access with respect to time?
#Checking with lag_plot for pct_access
from pandas.plotting import lag_plot
plt.figure(figsize=(16,8))
lag_plot(engagement_data_by_date22.pct_access)
plt.show()
Handle merge of engagement data with districts info and products info:
# merge dataframe using district_id as key
learn__Platform_data = pd.merge(engagement_dataframe,districts_dataframe,on=['district_id'])
# print summary of dataframe
learn__Platform_data.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 12360909 entries, 0 to 12360908 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 12360909 non-null datetime64[ns] 1 lp_id 12360909 non-null int64 2 pct_access 12360909 non-null float64 3 engagement_index 12360909 non-null float64 4 district_id 12360909 non-null int64 5 state 12360909 non-null object 6 locale 12360909 non-null object 7 pct_black/hispanic 12360909 non-null float64 8 pct_free/reduced 12360909 non-null float64 9 county_connections_ratio 12360909 non-null float64 10 pp_total_raw 12360909 non-null float64 dtypes: datetime64[ns](1), float64(6), int64(2), object(2) memory usage: 1.1+ GB
# merge dataframe using lp_id as key
learn__Platform_data = pd.merge(learn__Platform_data,products_dataframe,on=['lp_id'])
# print summary of dataframe
learn__Platform_data.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 6846182 entries, 0 to 6846181 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 6846182 non-null datetime64[ns] 1 lp_id 6846182 non-null int64 2 pct_access 6846182 non-null float64 3 engagement_index 6846182 non-null float64 4 district_id 6846182 non-null int64 5 state 6846182 non-null object 6 locale 6846182 non-null object 7 pct_black/hispanic 6846182 non-null float64 8 pct_free/reduced 6846182 non-null float64 9 county_connections_ratio 6846182 non-null float64 10 pp_total_raw 6846182 non-null float64 11 URL 6846182 non-null object 12 Product Name 6846182 non-null object 13 Provider/Company Name 6846182 non-null object 14 Sector(s) 6846182 non-null object 15 Primary Essential Function 6846182 non-null object dtypes: datetime64[ns](1), float64(6), int64(2), object(7) memory usage: 887.9+ MB
Data Definition:
D) LearnPlatform data
Q: What is the new structure of learnPlatform dataset?
| No. | Feature Name | Description of the feature |
|---|---|---|
| 01 | time | date in "YYYY-MM-DD" |
| 02 | lp_id | The unique identifier of the product |
| 03 | pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| 04 | engagement_index | Total page-load events per one thousand students of a given product and on a given day |
| 05 | district_id | The unique identifier of the school district |
| 06 | state | The state where the district resides in |
| 07 | locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information. |
| 08 | pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| 09 | pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| 10 | county_connections_ratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information. |
| 11 | pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |
| 12 | URL | Web Link to the specific product |
| 13 | Product Name | Name of the specific product |
| 14 | Provider/Company Name | Name of the product provider |
| 15 | Sector(s) | Sector of education where the product is used |
| 16 | Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |
# print summary of dataframe
learn__Platform_data.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'> Int64Index: 6846182 entries, 0 to 6846181 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 time 6846182 non-null datetime64[ns] 1 lp_id 6846182 non-null int64 2 pct_access 6846182 non-null float64 3 engagement_index 6846182 non-null float64 4 district_id 6846182 non-null int64 5 state 6846182 non-null object 6 locale 6846182 non-null object 7 pct_black/hispanic 6846182 non-null float64 8 pct_free/reduced 6846182 non-null float64 9 county_connections_ratio 6846182 non-null float64 10 pp_total_raw 6846182 non-null float64 11 URL 6846182 non-null object 12 Product Name 6846182 non-null object 13 Provider/Company Name 6846182 non-null object 14 Sector(s) 6846182 non-null object 15 Primary Essential Function 6846182 non-null object dtypes: datetime64[ns](1), float64(6), int64(2), object(7) memory usage: 887.9+ MB
Q: What does data looks like for learnPlatform dataset?
# print first 10 rows of dataframe
learn__Platform_data.head(5)
| time | lp_id | pct_access | engagement_index | district_id | state | locale | pct_black/hispanic | pct_free/reduced | county_connections_ratio | pp_total_raw | URL | Product Name | Provider/Company Name | Sector(s) | Primary Essential Function | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-01-01 | 53399 | 0.04 | 8.19 | 1052 | Illinois | Suburb | 0.3 | 0.3 | 0.59 | 17000.0 | https://spark.adobe.com/about/page | Adobe Spark Page | Adobe Inc. | PreK-12 | LC - Digital Learning Platforms |
| 1 | 2020-01-02 | 53399 | 0.12 | 25.39 | 1052 | Illinois | Suburb | 0.3 | 0.3 | 0.59 | 17000.0 | https://spark.adobe.com/about/page | Adobe Spark Page | Adobe Inc. | PreK-12 | LC - Digital Learning Platforms |
| 2 | 2020-01-03 | 53399 | 0.04 | 5.32 | 1052 | Illinois | Suburb | 0.3 | 0.3 | 0.59 | 17000.0 | https://spark.adobe.com/about/page | Adobe Spark Page | Adobe Inc. | PreK-12 | LC - Digital Learning Platforms |
| 3 | 2020-01-04 | 53399 | 0.25 | 18.02 | 1052 | Illinois | Suburb | 0.3 | 0.3 | 0.59 | 17000.0 | https://spark.adobe.com/about/page | Adobe Spark Page | Adobe Inc. | PreK-12 | LC - Digital Learning Platforms |
| 4 | 2020-01-05 | 53399 | 0.20 | 28.26 | 1052 | Illinois | Suburb | 0.3 | 0.3 | 0.59 | 17000.0 | https://spark.adobe.com/about/page | Adobe Spark Page | Adobe Inc. | PreK-12 | LC - Digital Learning Platforms |
Q: What is the overall statistics description for learnPlatform dataset?
# print descriptive statistics for both object and numeric type
learn__Platform_data.describe(include='all',datetime_is_numeric=True).transpose()
| count | unique | top | freq | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| time | 6846182 | NaN | NaN | NaN | 2020-07-07 19:09:40.855023360 | 2020-01-01 00:00:00 | 2020-04-01 00:00:00 | 2020-07-07 00:00:00 | 2020-10-17 00:00:00 | 2020-12-31 00:00:00 | NaN |
| lp_id | 6846182.0 | NaN | NaN | NaN | 54963.663806 | 10533.0 | 31827.0 | 55031.0 | 77698.0 | 99916.0 | 26615.701774 |
| pct_access | 6846182.0 | NaN | NaN | NaN | 1.037242 | 0.0 | 0.03 | 0.08 | 0.32 | 100.0 | 4.630972 |
| engagement_index | 6846182.0 | NaN | NaN | NaN | 258.319783 | 0.01 | 0.91 | 4.93 | 32.87 | 146282.09 | 2202.824126 |
| district_id | 6846182.0 | NaN | NaN | NaN | 5197.881231 | 1000.0 | 2956.0 | 4921.0 | 7660.0 | 9899.0 | 2641.791621 |
| state | 6846182 | 22 | Connecticut | 1081995 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| locale | 6846182 | 4 | Suburb | 4202589 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| pct_black/hispanic | 6846182.0 | NaN | NaN | NaN | 0.251313 | 0.1 | 0.1 | 0.1 | 0.3 | 0.9 | 0.230596 |
| pct_free/reduced | 6846182.0 | NaN | NaN | NaN | 0.340137 | 0.1 | 0.1 | 0.3 | 0.5 | 0.9 | 0.196169 |
| county_connections_ratio | 6846182.0 | NaN | NaN | NaN | 0.59016 | 0.59 | 0.59 | 0.59 | 0.59 | 1.5 | 0.012067 |
| pp_total_raw | 6846182.0 | NaN | NaN | NaN | 11774.976476 | 5000.0 | 9000.0 | 11000.0 | 13000.0 | 33000.0 | 3369.891826 |
| URL | 6846182 | 369 | http://docs.google.com/ | 53219 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Product Name | 6846182 | 369 | Google Docs | 53219 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Provider/Company Name | 6846182 | 289 | Google LLC | 842990 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sector(s) | 6846182 | 5 | PreK-12 | 2991455 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Primary Essential Function | 6846182 | 35 | LC - Digital Learning Platforms | 1665899 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# print summary of dataframe
learn__Platform_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6846182 entries, 0 to 6846181 Data columns (total 16 columns): # Column Dtype --- ------ ----- 0 time datetime64[ns] 1 lp_id int64 2 pct_access float64 3 engagement_index float64 4 district_id int64 5 state object 6 locale object 7 pct_black/hispanic float64 8 pct_free/reduced float64 9 county_connections_ratio float64 10 pp_total_raw float64 11 URL object 12 Product Name object 13 Provider/Company Name object 14 Sector(s) object 15 Primary Essential Function object dtypes: datetime64[ns](1), float64(6), int64(2), object(7) memory usage: 887.9+ MB
learnPlatform dataset data distribution:
Q: What is the distribution for Primary Essential Function with respect to engagement_index?
# distribution of Primary Essential Function with respect to engagement_index
learn__Platform_data.groupby('Primary Essential Function')[['engagement_index']].median().plot(kind='bar', figsize=(16, 8), color=['Green'])
plt.show()
Q: What is the distribution for Primary Essential Function with respect to pct_access?
# distribution of Primary Essential Function with respect to pct_acess
learn__Platform_data.groupby('Primary Essential Function')[['pct_access']].median().plot(kind='bar', figsize=(16, 8), color=['skyblue'])
plt.show()
Q: What is the distribution for Sector(s) with respect to engagement_index?
# distribution of Sectors with respect to engagement_index
learn__Platform_data.groupby('Sector(s)')[['engagement_index']].median().plot(kind='bar', figsize=(16, 8), color=['violet'])
plt.show()
Q: What is the distribution for Sector(s) with respect to pct_access?
# distribution of Sectors with respect to pct_index
learn__Platform_data.groupby('Sector(s)')[['pct_access']].median().plot(kind='bar', figsize=(16, 8), color=['yellow'])
plt.show()
Q: What is the correlation value between numeric variables?
#checking correlation between numeric variables
learn__Platform_data.corr()
<ipython-input-96-4056d8e94de3>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. learn__Platform_data.corr()
| lp_id | pct_access | engagement_index | district_id | pct_black/hispanic | pct_free/reduced | county_connections_ratio | pp_total_raw | |
|---|---|---|---|---|---|---|---|---|
| lp_id | 1.000000 | 0.059346 | 0.047553 | -0.004195 | -0.001915 | 0.001735 | 0.000586 | 0.001442 |
| pct_access | 0.059346 | 1.000000 | 0.733094 | -0.001161 | -0.016329 | -0.019005 | 0.007795 | 0.031840 |
| engagement_index | 0.047553 | 0.733094 | 1.000000 | 0.001762 | -0.011080 | -0.013028 | 0.001471 | 0.021871 |
| district_id | -0.004195 | -0.001161 | 0.001762 | 1.000000 | 0.068327 | -0.006337 | -0.011677 | 0.043147 |
| pct_black/hispanic | -0.001915 | -0.016329 | -0.011080 | 0.068327 | 1.000000 | 0.644704 | -0.008703 | 0.030620 |
| pct_free/reduced | 0.001735 | -0.019005 | -0.013028 | -0.006337 | 0.644704 | 1.000000 | -0.016235 | -0.055185 |
| county_connections_ratio | 0.000586 | 0.007795 | 0.001471 | -0.011677 | -0.008703 | -0.016235 | 1.000000 | -0.003050 |
| pp_total_raw | 0.001442 | 0.031840 | 0.021871 | 0.043147 | 0.030620 | -0.055185 | -0.003050 | 1.000000 |
Q: What is the correlation heatmap between numeric variables?
#checking correlation between numeric variables via heatmap
corr = learn__Platform_data.corr(method='pearson')
plt.figure(figsize=(16,16))
sns.heatmap(corr,vmax=.9,linewidth=.01, square = True, annot = True,cmap='coolwarm',linecolor ='black')
plt.show()
<ipython-input-97-be5b39abb733>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = learn__Platform_data.corr(method='pearson')
Q: What is the distribution for locale with respect to engagement_index?
# distribution of group by locale with respect to engagement_index
learn__Platform_data.groupby(['locale'])[['engagement_index']].median().plot(kind='bar', figsize=(16, 8),color=['lightgreen'])
plt.show()
Q: What is the distribution for locale with respect to pct_index?
# distribution of group by locale with respect to pct_access
learn__Platform_data.groupby(['locale'])[['pct_access']].median().plot(kind='bar', figsize=(16, 8),color=['blue'])
plt.show()
Q: What is the distribution for state with respect to engagement_index?
# distribution of group by state with respect to engagement_index
learn__Platform_data.groupby(['state'])[['engagement_index']].median().plot(kind='bar', figsize=(16, 8), color=['lightgreen'])
plt.show()
Q: What is the distribution for state with respect to pct_access?
# distribution of group by state with respect to pct_access
learn__Platform_data.groupby(['state'])[['pct_access']].median().plot(kind='bar', figsize=(16, 8), color=['blue'])
plt.show()
EDA Involves:
There was major disruption due to COVID-19 in 2020 and especially during peak case surge between May 2020 to Sept 2020 in USA (COVID-19 Case data) and it impacted every district primarily due to lockdown which resulted on closure of business and schools.
Based on trend analysis with respect to engagement index, there was significant impact due to COVID-19 especially during peak case surge between May 2020 to Sept 2020 across demographics for digital connectivity and engagement.
Based on trend analysis the effect of COVID-19 on online and distance learning was significant and it varies across demographics,Sector(s),Primary Essential Function and Product Name access.
Based on trend analysis with engagement index, student engagement was higher with SDO and CM level primary essential function followed by LC Level primary eseential function.
Based on trend analysis with enagement index, student engagement appears higher for Rural locale followed by Suburb and Town, state such as North Dakota , New York and New Hampshire appeared to be higher side for enagement index as they also had majority of Rural locale with higher engagement index.
This trend was not clearly visible with available dataset but may be possible by comparing with public dataset on COVID-19 policies.
Highlights:
Q:What are the top 10 Product Name with respect to Engagement Index?
# Get product name sorted by engagement_index
top___product_name = learn__Platform_data.groupby(by='Product Name', as_index=False)['engagement_index'].agg('mean').sort_values(by='engagement_index', ascending=False)
# Set plot figure size
figure = plt.figure(figsize=[14, 9])
# Plot comparison of top 10 product names
sns.barplot(x='engagement_index', y='Product Name', data=top___product_name[0:10])
# Set plot labels
plt.xlabel('ENGAGEMENT INDEX')
plt.ylabel('PRODUCT NAME')
plt.title('TOP 10 PRODUCT NAME')
# Using the grid function without specifying 'b'
plt.grid(axis='x')
plt.show()
Q:What are the top Sectors with respect to Engagement Index?
fig, ax = plt.subplots(figsize=[15, 10])
sns.barplot(x='engagement_index', y='Product Name', data=top___product_name[0:10], ax=ax)
ax.set_xlabel('ENGAGEMENT INDEX')
ax.set_ylabel('PRODUCT NAME')
ax.set_title('TOP 10 PRODUCT NAME')
ax.xaxis.grid(True) # Adding grid lines only on the x-axis
plt.show()
Q:What are the top 10 Primary Essential Function with respect to Engagement Index?
# Get primary essential function sorted by engagement_index
top___primary_essential = learn__Platform_data.groupby(by='Primary Essential Function', as_index=False)['engagement_index'].agg('mean').sort_values(by='engagement_index', ascending=False)
# Set plot figure size
fig, ax = plt.subplots(figsize=[14, 9])
# Plot comparison of top 10 primary essential functions
sns.barplot(x='engagement_index', y='Primary Essential Function', data=top___primary_essential[0:10], ax=ax)
# Set plot labels
ax.set_xlabel('ENGAGEMENT INDEX')
ax.set_ylabel('PRODUCT NAME')
ax.set_title('TOP 10 PRODUCT NAME')
# Adding grid lines only on the x-axis using Axes object
ax.xaxis.grid(True)
plt.show()
Q:What are the top Primary Essential Function in LC Level with respect to Engagement Index?
# Get primary essential function for LC Level sorted by engagement_index
learnPlatform__LC = learn__Platform_data[learn__Platform_data['Primary Essential Function'].str.contains("LC - ")]
top_primary__lc = learnPlatform__LC.groupby(by='Primary Essential Function', as_index=False)['engagement_index'].agg('mean').sort_values(by='engagement_index', ascending=False)
# Set plot figure size
figure = plt.figure(figsize=[16, 11])
# Plot comparison of top 5 primary essential functions for LC Level
sns.barplot(x='engagement_index', y='Primary Essential Function', data= top_primary__lc[0:5])
# Set plot labels
plt.xlabel('ENGAGEMENT INDEX')
plt.ylabel('PRIMARY ESSENTIAL FUNCTION')
plt.title('TOP 5 PRIMARY ESSENTIAL FUNCTIONS in LC Level')
# Remove the plt.grid() call
plt.show()
Q:What are the top Primary Essential Function in CM Level with respect to Engagement Index?
# Get primary essential function for CM Level sorted by engagement_index
learnPlatform___CM = learn__Platform_data[learn__Platform_data['Primary Essential Function'].str.contains("CM - ")]
top_primary_cm = learnPlatform___CM.groupby(by='Primary Essential Function', as_index=False)['engagement_index'].agg('mean').sort_values(by='engagement_index', ascending=False)
# Set plot figure size
figure = plt.figure(figsize=[16, 8])
# Plot comparison of top 5 primary essential functions for CM Level
sns.barplot(x='engagement_index', y='Primary Essential Function', data=top_primary_cm[0:5])
# Set plot labels
plt.xlabel('ENGAGEMENT INDEX')
plt.ylabel('PRIMARY ESSENTIAL FUNCTION')
plt.title('TOP 5 PRIMARY ESSENTIAL FUNCTIONS in LC Level')
# Remove the plt.grid() call
plt.show()
Q:What are the top Primary Essential Function in SDO Level with respect to Engagement Index?
# Get primary essential function for SDO Level sorted by engagement_index
learn__Platform_SDO = learn__Platform_data[learn__Platform_data['Primary Essential Function'].str.contains("SDO - ")]
top_primary_sdo = learn__Platform_SDO.groupby(by='Primary Essential Function', as_index=False)['engagement_index'].agg('mean').sort_values(by='engagement_index', ascending=False)
# Set plot figure size
figure = plt.figure(figsize=[16, 10])
# Plot comparison of top 5 primary essential functions for SDO Level
sns.barplot(x='engagement_index', y='Primary Essential Function', data=top_primary_sdo[0:5])
# Set plot labels
plt.xlabel('ENGAGEMENT INDEX')
plt.ylabel('PRIMARY ESSENTIAL FUNCTION')
plt.title('TOP 5 PRIMARY ESSENTIAL FUNCTIONS in SDO Level')
# Omit the plt.grid() function to avoid the ValueError
plt.show()
Q:What are the top 10 States with respect to Engagement Index?
# Get state sorted by engagement_index
top_state = learn__Platform_data.groupby(by='state', as_index=False)['engagement_index'].agg('mean').sort_values(by='engagement_index', ascending=False)
# Set plot figure size
figure = plt.figure(figsize=[14, 14])
# Plot comparison of top 10 states
sns.barplot(x='engagement_index', y='state', data=top_state[0:10])
# Set plot labels
plt.xlabel('ENGAGEMENT INDEX')
plt.ylabel('STATE')
plt.title('TOP 10 STATES')
# Omit the plt.grid() function to avoid the ValueError
plt.show()
Q:What are the top Locale with respect to Engagement Index?
# Get locale sorted by engagement_index
top_locale = learn__Platform_data.groupby(by='locale', as_index=False)['engagement_index'].agg('mean').sort_values(by='engagement_index', ascending=False)
# Set plot figure size
figure = plt.figure(figsize=[16, 7])
# Plot comparison of top locales
sns.barplot(x='engagement_index', y='locale', data=top_locale[0:4])
# Set plot labels
plt.xlabel('ENGAGEMENT INDEX')
plt.ylabel('LOCALE')
plt.title('TOP LOCALE')
# Omit the plt.grid() function to avoid the ValueError
plt.show()